GraphBuilder: A Scalable Graph ETL Framework
نویسندگان
چکیده
Graph abstraction is essential for many applications, from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. However, constructing graphs from relationships hidden within large unstructured datasets is challenging. Since graph construction is a data-parallel problem, MapReduce is well-suited for this task. We developed GraphBuilder, an open source scalable framework for graph Extract-Transform-Load (ETL), to offload many of the complexities of graph construction, including graph construction, transformation, normalization, and partitioning. GraphBuilder is written in Java, for ease of programming, and it scales using the MapReduce model. In this paper, we describe the motivation for GraphBuilder, its architecture, MapReduce algorithms for graph processing, and a performance evaluation of the framework. Since large graphs should be partitioned over a cluster for storing and processing and partitioning methods have a significant impact on performance, we develop several graph partitioning methods and evaluate their performance.
منابع مشابه
GraphBuilder – A Scalable Graph Construction Library for ApacheTM HadoopTM
The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these fra...
متن کاملETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce
Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL ...
متن کاملCloudETL: Scalable Dimensional ETL for Hadoop and Hive
Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a D...
متن کاملSystematic ETL management - Experiences with high-level operators
Large organizations load much of their data into data warehouses for subsequent querying, analysis, and data mining. Extract-Transform-Load (ETL) workflows populate those data warehouses with data from various data sources by specifying and executing a set of transformations forming a directed acyclic transformation graph (DAG). Over time, hundreds of individual ETL workflows evolve as new sour...
متن کاملRule-Based Management of Schema Changes at ETL Sources
In this paper, we visit the problem of the management of inconsistencies emerging on ETL processes as results of evolution operations occurring at their sources. We abstract Extract-Transform-Load (ETL) activities as queries and sequences of views. ETL activities and its sources are uniformly modeled as a graph that is annotated with rules for the management of evolution events. Given a change ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013